AITopics | fault tolerance

Collaborating Authors

fault tolerance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Fault-Tolerant MARL for CAVs under Observation Perturbations for Highway On-Ramp Merging

Shi, Yuchen, Pei, Huaxin, Zhang, Yi, Yao, Danya

arXiv.org Artificial IntelligenceDec-1-2025

Multi-Agent Reinforcement Learning (MARL) holds significant promise for enabling cooperative driving among Connected and Automated Vehicles (CAVs). However, its practical application is hindered by a critical limitation, i.e., insufficient fault tolerance against observational faults. Such faults, which appear as perturbations in the vehicles' perceived data, can substantially compromise the performance of MARL-based driving systems. Addressing this problem presents two primary challenges. One is to generate adversarial perturbations that effectively stress the policy during training, and the other is to equip vehicles with the capability to mitigate the impact of corrupted observations. To overcome the challenges, we propose a fault-tolerant MARL method for cooperative on-ramp vehicles incorporating two key agents. First, an adversarial fault injection agent is co-trained to generate perturbations that actively challenge and harden the vehicle policies. Second, we design a novel fault-tolerant vehicle agent equipped with a self-diagnosis capability, which leverages the inherent spatio-temporal correlations in vehicle state sequences to detect faults and reconstruct credible observations, thereby shielding the policy from misleading inputs. Experiments in a simulated highway merging scenario demonstrate that our method significantly outperforms baseline MARL approaches, achieving near-fault-free levels of safety and efficiency under various observation fault patterns.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2511.23193

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States (0.04)

Genre: Research Report (0.82)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Federated Multi-Task Learning

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, Ameet S. Talwalkar

Neural Information Processing SystemsNov-21-2025, 09:57:23 GMT

Following [25, 36, 26], we summarize the unique challenges of federated learning below.

artificial intelligence, deep learning, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Virginia (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(3 more...)

Industry: Information Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

In-Place Zero-Space Memory Protection for CNN

Hui Guan, Lin Ning, Zhen Lin, Xipeng Shen, Huiyang Zhou, Seung-Hwan Lim

Neural Information Processing SystemsNov-15-2025, 21:01:42 GMT

This paper introduces in-place zero-space ECC assisted with a new training scheme weight distribution-oriented training .

artificial intelligence, machine learning, protection, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > Canada (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.93)

Industry: Government > Regional Government > North America Government > United States Government (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

Liu, Fangxin, Wang, Zongwu, Xia, JinHong, Zhao, Junping, Zhao, Shouren, Li, Jinjin, Liu, Jian, Jiang, Li, Guan, Haibing

arXiv.org Artificial IntelligenceOct-22-2025

The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.

large language model, machine learning, quantization, (18 more...)

arXiv.org Artificial Intelligence

2506.12024

Country: Asia > China > Shanghai > Shanghai (0.05)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms

O'Keeffe, James

arXiv.org Artificial IntelligenceOct-10-2025

--An active approach to fault tolerance is essential for robot swarms to achieve long-term autonomy. Previous e fforts have focused on responding to spontaneous electro-mechanical faults and failures. However, many faults occur gradually over time. This work argues that the principles of predictive maintenance, in which potential faults are resolved before they hinder the operation of the swarm, o ffer a promising means of achieving long-term fault tolerance. This is a novel approach to swarm fault tolerance, which is shown to give a comparable or improved performance when tested against a reactive approach in almost all cases tested. However, a significant barrier to the deployment of autonomous robots in many real-world applications is the risk of failure or loss of autonomous control in the field.

artificial intelligence, fault tolerance, robot, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LRA.2025.3592063

2504.01594

Country: Europe > United Kingdom (0.04)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

In-Place Zero-Space Memory Protection for CNN

Hui Guan, Lin Ning, Zhen Lin, Xipeng Shen, Huiyang Zhou, Seung-Hwan Lim

Neural Information Processing SystemsOct-2-2025, 03:01:21 GMT

This paper introduces in-place zero-space ECC assisted with a new training scheme weight distribution-oriented training .

artificial intelligence, machine learning, protection, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Tennessee > Anderson County > Oak Ridge (0.04)
North America > United States > North Carolina > Wake County > Raleigh (0.04)
North America > Canada (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.93)

Industry: Government > Regional Government > North America Government > United States Government (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A Three-Level Whole-Body Disturbance Rejection Control Framework for Dynamic Motions in Legged Robots

Li, Bolin, Zuo, Gewei, Wang, Zhixiang, Ke, Xiaotian, Zhu, Lijun, Ding, Han

arXiv.org Artificial IntelligenceAug-28-2025

Abstract--This paper presents a control framework designed to enhance the stability and robustness of legged robots in the presence of uncertainties, including model uncertainties, external disturbances, and faults. The framework enables the full-state feedback estimator to estimate and compensate for uncertainties in the whole-body dynamics of the legged robots. First, we propose a novel moving horizon extended state observer (MH-ESO) to estimate uncertainties and mitigate noise in legged systems, which can be integrated into the framework for disturbance compensation. Second, we introduce a three-level whole-body disturbance rejection control framework (T -WB-DRC). Unlike the previous two-level approach, this three-level framework considers both the plan based on whole-body dynamics without uncertainties and the plan based on dynamics with uncertainties, significantly improving payload transportation, external disturbance rejection, and fault tolerance. Third, simulations of both humanoid and quadruped robots in the Gazebo simulator demonstrate the effectiveness and versatility of T -WB-DRC. Note to Practitioners--This paper presents a practical control framework to significantly improve the robustness of legged robots against real-world uncertainties like unknown payloads, external pushes, and actuator faults. Its core is a novel three-level whole-body controller (T -WB-DRC) that uses a moving horizon estimator (MH-ESO) to accurately identify and compensate for disturbances in real-time. This dual-planning approach, which considers both ideal and disturbance-injected dynamics, outperforms previous methods. The framework's effectiveness in enhancing stability under disturbances has been successfully validated through extensive simulations and physical experiments on a quadruped robot.

artificial intelligence, legged robot, robot, (18 more...)

arXiv.org Artificial Intelligence

2508.13531

Country:

Asia > China > Hubei Province > Wuhan (0.05)
Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)

Add feedback

FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention

Dai, Huangliang, Wu, Shixun, Huang, Jiajun, Jian, Zizhe, Zhu, Yue, Hu, Haiyang, Chen, Zizhong

arXiv.org Artificial IntelligenceAug-14-2025

Transformer models rely on High-Performance Computing (HPC) resources for inference, where soft errors are inevitable in large-scale systems, making the reliability of the model particularly critical. Existing fault tolerance frameworks for Transformers are designed at the operation level without architectural optimization, leading to significant computational and memory overhead, which in turn reduces protection efficiency and limits scalability to larger models. In this paper, we implement module-level protection for Transformers by treating the operations within the attention module as a single kernel and applying end-to-end fault tolerance. This method provides unified protection across multi-step computations, while achieving comprehensive coverage of potential errors in the nonlinear computations. For linear modules, we design a strided algorithm-based fault tolerance (ABFT) that avoids inter-thread communication. Experimental results show that our end-to-end fault tolerance achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2504.02211

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > Riverside County > Riverside (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers

Titopoulos, Vasileios, Alexandridis, Kosmas, Dimitrakopoulos, Giorgos

arXiv.org Artificial IntelligenceJul-23-2025

Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.16676

Country: Europe > Greece (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Semiconductors & Electronics (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Fault-Tolerant Multi-Robot Coordination with Limited Sensing within Confined Environments

Aina, Kehinde O., Bagheri, Hosain, Goldman, Daniel I.

arXiv.org Artificial IntelligenceMay-22-2025

As robots are increasingly deployed to collaborate on tasks within shared workspaces and resources, the failure of an individual robot can critically affect the group's performance. This issue is particularly challenging when robots lack global information or direct communication, relying instead on social interaction for coordination and to complete their tasks. In this study, we propose a novel fault-tolerance technique leveraging physical contact interactions in multi-robot systems, specifically under conditions of limited sensing and spatial confinement. We introduce the "Active Contact Response" (ACR) method, where each robot modulates its behavior based on the likelihood of encountering an inoperative (faulty) robot. Active robots are capable of collectively repositioning stationary and faulty peers to reduce obstructions and maintain optimal group functionality. We implement our algorithm in a team of autonomous robots, equipped with contact-sensing and collision-tolerance capabilities, tasked with collectively excavating cohesive model pellets. Experimental results indicate that the ACR method significantly improves the system's recovery time from robot failures, enabling continued collective excavation with minimal performance degradation. Thus, this work demonstrates the potential of leveraging local, social, and physical interactions to enhance fault tolerance and coordination in multi-robot systems operating in constrained and extreme environments.

artificial intelligence, faulty robot, robot, (17 more...)

arXiv.org Artificial Intelligence

2505.15036

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback